CRUXEval-output

pairwise wins

p-values

delta vs. p-values

result table

model pass1 win_rate elo
0 gpt-4-turbo-2024-04-09+cot 0.820 0.928 1544.180
1 gpt-4-0613+cot 0.771 0.920 1519.130
2 claude-3-opus-20240229+cot 0.820 0.862 1404.876
3 gpt-3.5-turbo-0613+cot 0.590 0.801 1323.343
4 gpt-4-0613 0.687 0.735 1235.863
5 codellama-34b+cot 0.436 0.723 1225.902
6 gpt-4-turbo-2024-04-09 0.677 0.715 1215.286
7 claude-3-opus-20240229 0.657 0.665 1163.230
8 codellama-13b+cot 0.360 0.656 1162.080
9 codellama-7b+cot 0.299 0.556 1070.951
10 deepseek-base-33b 0.486 0.501 1019.993
11 deepseek-instruct-33b 0.499 0.501 1019.080
12 gpt-3.5-turbo-0613 0.494 0.475 1000.000
13 codetulu-2-34b 0.458 0.463 987.470
14 deepseek-base-6.7b 0.435 0.446 976.538
15 magicoder-ds-7b 0.444 0.429 960.910
16 codellama-34b 0.424 0.410 945.742
17 mixtral-8x7b 0.405 0.409 946.291
18 codellama-13b 0.397 0.381 927.468
19 wizard-34b 0.434 0.360 906.376
20 wizard-13b 0.413 0.357 904.437
21 codellama-python-34b 0.414 0.348 897.732
22 codellama-python-13b 0.398 0.343 893.673
23 deepseek-instruct-6.7b 0.412 0.318 873.682
24 phind 0.397 0.306 864.398
25 phi-2 0.335 0.292 849.806
26 codellama-python-7b 0.359 0.290 848.583
27 mistral-7b 0.343 0.273 833.922
28 codellama-7b 0.342 0.270 833.731
29 starcoderbase-16b 0.342 0.268 828.742